Search CORE

14 research outputs found

The Impact of RDMA on Agreement

Author: Aguilera Marcos K.
Ben-David Naama
Guerraoui Rachid
Marathe Virendra
Zablotchi Igor
Publication venue
Publication date: 03/03/2020
Field of study

Remote Direct Memory Access (RDMA) is becoming widely available in data centers. This technology allows a process to directly read and write the memory of a remote host, with a mechanism to control access permissions. In this paper, we study the fundamental power of these capabilities. We consider the well-known problem of achieving consensus despite failures, and find that RDMA can improve the inherent trade-off in distributed computing between failure resilience and performance. Specifically, we show that RDMA allows algorithms that simultaneously achieve high resilience and high performance, while traditional algorithms had to choose one or another. With Byzantine failures, we give an algorithm that only requires

n \geq 2f_P + 1

processes (where

f_P

is the maximum number of faulty processes) and decides in two (network) delays in common executions. With crash failures, we give an algorithm that only requires

n \geq f_P + 1

processes and also decides in two delays. Both algorithms tolerate a minority of memory failures inherent to RDMA, and they provide safety in asynchronous systems and liveness with standard additional assumptions.Comment: Full version of PODC'19 paper, strengthened broadcast algorith

arXiv.org e-Print Archive

Infoscience - École polytechnique fédérale de Lausanne

Efficient Multi-Word Compare and Swap

Author: Guerraoui Rachid
Kogan Alex
Marathe Virendra J.
Zablotchi Igor
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 34th International Symposium on Distributed Computing (DISC 2020)
Publication date: 01/01/2020
Field of study

Atomic lock-free multi-word compare-and-swap (MCAS) is a powerful tool for designing concurrent algorithms. Yet, its widespread usage has been limited because lock-free implementations of MCAS make heavy use of expensive compare-and-swap (CAS) instructions. Existing MCAS implementations indeed use at least 2k+1 CASes per k-CAS. This leads to the natural desire to minimize the number of CASes required to implement MCAS. We first prove in this paper that it is impossible to "pack" the information required to perform a k-word CAS (k-CAS) in less than k locations to be CASed. Then we present the first algorithm that requires k+1 CASes per call to k-CAS in the common uncontended case. We implement our algorithm and show that it outperforms a state-of-the-art baseline in a variety of benchmarks in most considered workloads. We also present a durably linearizable (persistent memory friendly) version of our MCAS algorithm using only 2 persistence fences per call, while still only requiring k+1 CASes per k-CAS

Infoscience - École polytechnique fédérale de Lausanne

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

The Inherent Cost of Remembering Consistently

Author: Cohen Nachshon
Guerraoui Rachid
Zablotchi Mihail Igor
Publication venue
Publication date: 20/11/2017
Field of study

Non-volatile memory (NVM) promises fast, byte-addressable and durable storage, with raw access latencies in the same order of magnitude as DRAM. But in order to take advantage of the durability of NVM, programmers need to design persistent objects which maintain consistent state across system crashes and restarts. Concurrent implementations of persistent objects typically make heavy use of expensive persistent fence instructions to order NVM accesses, thus negating some of the performance benefits of NVM. This raises the question of the minimal number of persistent fence instructions required to implement a persistent object. We answer this question in the deterministic lock-free case by providing lower and upper bounds on the required number of fence instructions. We obtain our upper bound by presenting a new universal construction that implements durably any object using at most one persistent fence per update operation invoked. Our lower bound states that in the worst case, each process needs to issue at least one persistent fence per update operation invoked

Infoscience - École polytechnique fédérale de Lausanne

Cliff-Learning

Author: Rosenfeld Jonathan S.
Shavit Nir
Wang Tony T.
Zablotchi Igor
Publication venue
Publication date: 14/02/2023
Field of study

We study the data-scaling of transfer learning from foundation models in the low-downstream-data regime. We observe an intriguing phenomenon which we call cliff-learning. Cliff-learning refers to regions of data-scaling laws where performance improves at a faster than power law rate (i.e. regions of concavity on a log-log scaling plot). We conduct an in-depth investigation of foundation-model cliff-learning and study toy models of the phenomenon. We observe that the degree of cliff-learning reflects the degree of compatibility between the priors of a learning algorithm and the task being learned.Comment: 13 page

arXiv.org e-Print Archive

Log-Free Concurrent Data Structures

Author: David Tudor
Dragojević Aleksandar
Guerraoui Rachid
Zablotchi Mihail Igor
Publication venue: USENIX Association
Publication date: 20/11/2017
Field of study

Non-volatile RAM (NVRAM) makes it possible for data structures to tolerate transient failures, assuming however that programmers have designed these structures such that their consistency is preserved upon recovery. Previous ap- proaches are typically transactional and inherently make heavy use of logging, resulting in implementations that are significantly slower than their DRAM counterparts. In this paper, we introduce a set of techniques aimed at lock-free data structures that, in the large majority of cases, remove the need for logging (and costly durable store instructions) both in the data structure algorithm and in the associated memory management scheme. Together, these generic techniques enable us to design what we call log-free concurrent data structures, which, as we illustrate on linked lists, hash tables, skip lists, and BSTs, can provide several-fold performance improvements over previous transaction-based implementations, with overheads of the order of milliseconds for recovery after a failure. We also highlight how our techniques can be integrated into practical systems, by presenting a durable version of Memcached that maintains the performance of its volatile counterpart

Infoscience - École polytechnique fédérale de Lausanne

FloDB: Unlocking Memory in Persistent Key-Value Stores

Author: Balmau Oana Maria
Guerraoui Rachid
Trigonakis Vasileios
Zablotchi Mihail Igor
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 13/07/2018
Field of study

Log-structured merge (LSM) data stores enable to store and process large volumes of data while maintaining good performance. They mitigate the I/O bottleneck by absorbing updates in a memory layer and transferring them to the disk layer in sequential batches. Yet, the LSM architecture fundamentally requires elements to be in sorted order. As the amount of data in memory grows, maintaining this sorted order becomes increasingly costly. Contrary to intuition, existing LSM systems could actually lose throughput with larger memory components. In this paper, we introduce FloDB, an LSM memory component architecture which allows throughput to scale on modern multicore machines with ample memory sizes. The main idea underlying FloDB is essentially to bootstrap the traditional LSM architecture by adding a small in-memory buffer layer on top of the memory component. This buffer offers low-latency operations, masking the write latency of the sorted memory component. Integrating this buffer in the classic LSM memory component to obtain FloDB is not trivial and requires revisiting the algorithms of the user-facing LSM operations (search, update, scan). FloDB's two layers can be implemented with state-of-the-art, highly-concurrent data structures. This way, as we show in the paper, FloDB eliminates significant synchronization bottlenecks in classic LSM designs, while offering a rich LSM API. We implement FloDB as an extension of LevelDB, Google's popular LSM key-value store. We compare FloDB's performance to that of state-of-the-art LSMs. In short, FloDB's performance is up to one order of magnitude higher than that of the next best-performing competitor in a wide range of multi-threaded workloads

Infoscience - École polytechnique fédérale de Lausanne

Frugal Byzantine Computing

Author: Aguilera Marcos K.
Ben-David Naama
Guerraoui Rachid
Papuc Dalia
Xygkis Athanasios
Zablotchi Igor
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 35th International Symposium on Distributed Computing (DISC 2021)
Publication date: 01/01/2021
Field of study

Traditional techniques for handling Byzantine failures are expensive: digital signatures are too costly, while using 3f+1 replicas is uneconomical (f denotes the maximum number of Byzantine processes). We seek algorithms that reduce the number of replicas to 2f+1 and minimize the number of signatures. While the first goal can be achieved in the message-and-memory model, accomplishing the second goal simultaneously is challenging. We first address this challenge for the problem of broadcasting messages reliably. We study two variants of this problem, Consistent Broadcast and Reliable Broadcast, typically considered very close. Perhaps surprisingly, we establish a separation between them in terms of signatures required. In particular, we show that Consistent Broadcast requires at least 1 signature in some execution, while Reliable Broadcast requires O(n) signatures in some execution. We present matching upper bounds for both primitives within constant factors. We then turn to the problem of consensus and argue that this separation matters for solving consensus with Byzantine failures: we present a practical consensus algorithm that uses Consistent Broadcast as its main communication primitive. This algorithm works for n = 2f+1 and avoids signatures in the common case - properties that have not been simultaneously achieved previously. Overall, our work approaches Byzantine computing in a frugal manner and motivates the use of Consistent Broadcast - rather than Reliable Broadcast - as a key primitive for reaching agreement

Dagstuhl Research Online Publication Server

Fast and Robust Memory Reclamation for Concurrent Data Structures

Author: Balmau Oana Maria
Guerraoui Rachid
Herlihy Maurice
Zablotchi Mihail Igor
Publication venue
Publication date: 17/05/2016
Field of study

In concurrent systems without automatic garbage collection, it is challenging to determine when it is safe to reclaim memory, especially for lock-free data structures. Existing concurrent memory reclamation schemes are either fast but do not tolerate process delays, robust to delays but with high overhead, or both robust and fast but narrowly applicable. This paper proposes QSense, a novel concurrent memory reclamation technique. QSense is a hybrid technique with a fast path and a fallback path. In the common case (without process delays), a high-performing memory reclamation scheme is used (fast path). If process delays block memory reclamation through the fast path, a robust fallback path is used to guarantee progress. The fallback path uses hazard pointers, but avoids their notorious need for frequent and expensive memory fences. QSense is widely applicable, as we illustrate through several lock-free data structure algorithms. Our experimental evaluation shows that QSense has an overhead comparable to the fastest memory reclamation techniques, while still tolerating prolonged process delays

Infoscience - École polytechnique fédérale de Lausanne

Honeycomb: ordered key-value store acceleration on an FPGA-based SmartNIC

Author: Castro Miguel
Dragojevic Aleksandar
Flemming Shane
Kalia Anuj
Katsarakis Antonios
Korolija Dario
Liu Junyi
Ng Ho-cheung
Zablotchi Igor
Publication venue
Publication date: 06/04/2023
Field of study

In-memory ordered key-value stores are an important building block in modern distributed applications. We present Honeycomb, a hybrid software-hardware system for accelerating read-dominated workloads on ordered key-value stores that provides linearizability for all operations including scans. Honeycomb stores a B-Tree in host memory, and executes SCAN and GET on an FPGA-based SmartNIC, and PUT, UPDATE and DELETE on the CPU. This approach enables large stores and simplifies the FPGA implementation but raises the challenge of data access and synchronization across the slow PCIe bus. We describe how Honeycomb overcomes this challenge with careful data structure design, caching, request parallelism with out-of-order request execution, wait-free read operations, and batching synchronization between the CPU and the FPGA. For read-heavy YCSB workloads, Honeycomb improves the throughput of a state-of-the-art ordered key-value store by at least 1.8x. For scan-heavy workloads inspired by cloud storage, Honeycomb improves throughput by more than 2x. The cost-performance, which is more important for large-scale deployments, is improved by at least 1.5x on these workloads

arXiv.org e-Print Archive

The Disclosure Power of Shared Objects

Author: Blanchard Peva François
Guerraoui Rachid
Stainer Julien Michel
Zablotchi Mihail Igor
Publication venue
Publication date: 03/03/2017
Field of study

Shared objects are the means by which processes gather and exchange information about the state of a distributed system. Objects that disclose more information about the system—and thus provide a more centralized view—are therefore more desirable. In this paper, we propose the schedule reconstruction (SR) problem as a new metric for the disclosure power of shared memory objects. In schedule reconstruction, processes take steps which are interleaved to form a schedule; each process needs to be able to reconstruct the schedule up to its last step. We show that objects can be ranked in a hierarchy according to their ability to solve SR. In this hierarchy, stronger objects can implement weaker objects via a SR-based universal construction. We identify a connection between SR and consensus and prove that SR is at least as hard as consensus. Perhaps surprisingly, we show that objects that are powerful in solving consensus—such as compare-and-swap—are not always powerful in their ability to solve SR

Infoscience - École polytechnique fédérale de Lausanne